In [ ]:
 
  • Download data file movie_metadata.csv
  • Get additional data from other sources if required.
  • Perform Data Preprocessing and Exploratory Data Analysis which includes data visualization also.
  • Create at least 3 different machine learning models to predict IMDB rating of a movie.
  • Compare the results and suggest the model which could be useful to deploy into production.
  • Optional: you can also use TensorFlow, Keras or Pytorch to build the models.
In [1]:
#import the created functions package
import functions
In [2]:
# Importing neccessary Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import mode
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import MultiLabelBinarizer
import plotly.express as px
import plotly.graph_objects as go

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

%matplotlib inline
In [3]:
from mpl_toolkits import mplot3d

import missingno as msno ##to visualize the missig values

Load the data and quick view

In [4]:
##load the data in DataFrame
url = 'https://raw.githubusercontent.com/sundeepblue/movie_rating_prediction/master/movie_metadata.csv'
df = pd.read_csv(url, index_col=0)
In [5]:
##quick view of the dataset
df.head(3)
Out[5]:
director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres actor_1_name ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
color
Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi CCH Pounder ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy Johnny Depp ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller Christoph Waltz ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000

3 rows × 27 columns

In [6]:
##get the number of rows and columns
display(df.shape)

# no of rows and columns present in the airbnb data 
df_rows = df.shape[0]
df_columns = df.shape[1]
print('There are {} rows and {} columns in the imdb ratings dataset.'.format(df_rows,df_columns))
(5043, 27)
There are 5043 rows and 27 columns in the imdb ratings dataset.
In [7]:
##column names
df.columns
Out[7]:
Index(['director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')
In [ ]:
 

Data cleaning and prep

In [8]:
### reset the index so that color can become a feature too
df2 = df.reset_index()
In [9]:
#quick check
df2.head(1)
Out[9]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000

1 rows × 28 columns

In [10]:
df2.shape
Out[10]:
(5043, 28)

Quick look at the completeness of the initial data

In [11]:
#Number of unique movie title
unique_val = df2.movie_title.nunique()

print('There are {} unique values in imdb ratings dataset.'.format(unique_val))
There are 4917 unique values in imdb ratings dataset.
In [12]:
# Number of duplicate rows based on movie title
duplicates = df2.movie_title.duplicated().sum()
print('There are {} duplicate values in imdb ratings dataset.'.format(duplicates))
There are 126 duplicate values in imdb ratings dataset.
In [13]:
## Drop the duplicate values based on movie-title 
#df2 = df2.movie_title.drop_duplicates()
df2 = df2.drop_duplicates(subset='movie_title')
In [14]:
##count of movies after dropping the duplicates
df2.shape[0]
Out[14]:
4917
In [15]:
df2.head(2)
Out[15]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0

2 rows × 28 columns

In [ ]:

In [16]:
#while inspecting through  other sources the title had 'Â' at the end so we will remove them if any exists

df2.movie_title=df2.movie_title.str.strip() 
df2.movie_title=df2.movie_title.replace('Â', '', regex=True)
In [17]:
df2.movie_title;
In [18]:
##count of non-null/NA values in each columns
df2.count()
Out[18]:
color                        4898
director_name                4815
num_critic_for_reviews       4868
duration                     4902
director_facebook_likes      4815
actor_3_facebook_likes       4894
actor_2_name                 4904
actor_1_facebook_likes       4910
gross                        4054
genres                       4917
actor_1_name                 4910
movie_title                  4917
num_voted_users              4917
cast_total_facebook_likes    4917
actor_3_name                 4894
facenumber_in_poster         4904
plot_keywords                4765
movie_imdb_link              4917
num_user_for_reviews         4896
language                     4905
country                      4912
content_rating               4617
budget                       4433
title_year                   4811
actor_2_facebook_likes       4904
imdb_score                   4917
aspect_ratio                 4591
movie_facebook_likes         4917
dtype: int64

Visualizing the completeness of initial data

In [19]:
# Visualize the completeness/ missing values, the bar graphs shows how complete the data is
#Higher the bar, higher the data completeness and lower the NULL values and vice versa
# values as a bar chart 
msno.bar(df2) 
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x23d23bf3ec8>

Missing/NA values

In [20]:
#using the missing_value function saved under functions module
#missing_value functions takes in a dataframe and gives a table and plot of the total nulls and percentage of nulls for each columns in the dataframe
functions.missing_value(df2)
null_percentage total_null_values
color 0.39 19
director_name 2.07 102
num_critic_for_reviews 1.00 49
duration 0.31 15
director_facebook_likes 2.07 102
actor_3_facebook_likes 0.47 23
actor_2_name 0.26 13
actor_1_facebook_likes 0.14 7
gross 17.55 863
genres 0.00 0
actor_1_name 0.14 7
movie_title 0.00 0
num_voted_users 0.00 0
cast_total_facebook_likes 0.00 0
actor_3_name 0.47 23
facenumber_in_poster 0.26 13
plot_keywords 3.09 152
movie_imdb_link 0.00 0
num_user_for_reviews 0.43 21
language 0.24 12
country 0.10 5
content_rating 6.10 300
budget 9.84 484
title_year 2.16 106
actor_2_facebook_likes 0.26 13
imdb_score 0.00 0
aspect_ratio 6.63 326
movie_facebook_likes 0.00 0

We can see from the bar graph the table above that 'gross' and 'budget' have highest missing values with 17.5% and 9.8% null values respectively. One approach would be to drop those columns but they don't have extremely high missing values and also they seem to be important features and the budget of the film suggest the quality of actors and movies to certain extent and also the gross earning of a movie is generally high if the viewers like the movie and they both are generally reflected on IMDB scores. Therefore, it is in our interest to keep these features for further analysis. Therefore, we will only delete rows with null values for 'gross' and 'budget' because imputation will not be a good approach here. We may revisit this later and create another model to replace these missing values if needed. </font>

.

.

In [21]:
##removing the null values from features 'gross' and 'budget'
df3 = df2[df2['gross'].notna() & df2['budget'].notna()]
df3.shape
Out[21]:
(3789, 28)
In [22]:
remaining_pct = ((df2.shape[0]-df3.shape[0])/df2.shape[0])*100
print('We have removed {} percentage of values from the previous imdb ratings dataset.'.format(remaining_pct))
We have removed 22.940817571690054 percentage of values from the previous imdb ratings dataset.

We Still have 3789 rows in our dataset which is good data size

Further data cleaning

In [23]:
##inspectinge genre
df3.genres.head(3)
Out[23]:
0    Action|Adventure|Fantasy|Sci-Fi
1           Action|Adventure|Fantasy
2          Action|Adventure|Thriller
Name: genres, dtype: object

The genres contains multiple genre for the same movie. Upon intuition, it feels like genre may be important factor for IMDB score rating. So we ill first inspect if the intuition is true or not. To do this, we will create a on-hot-encoded table for all genres.

In [24]:
#creating a genre dataframe
genre_temp = df3[['genres','imdb_score']]
In [25]:
##get the list of unique genres
genrelist = genre_temp.genres.str.split('|') ##splitting the genre column
genrelist;
flattened_list = [i for x in genrelist for i in x] ##creating a flattened version of list of list
genres = list(set(flattened_list))   ##getting thee unique values from list
genres;
In [26]:
genrelist
Out[26]:
0            [Action, Adventure, Fantasy, Sci-Fi]
1                    [Action, Adventure, Fantasy]
2                   [Action, Adventure, Thriller]
3                              [Action, Thriller]
5                     [Action, Adventure, Sci-Fi]
                          ...                    
5033                    [Drama, Sci-Fi, Thriller]
5034                                   [Thriller]
5035    [Action, Crime, Drama, Romance, Thriller]
5037                              [Comedy, Drama]
5042                                [Documentary]
Name: genres, Length: 3789, dtype: object
In [27]:
##replacing the genre value with list 
genre_temp.genres=genre_temp.genres.str.split('|').copy() 

#reset the index so we can have matching index on which we can merge dataframe
genre_temp= genre_temp.reset_index() #resets the index and creates a seperate column naed index for the old index
genre_temp =genre_temp.drop(columns=['index']) #we drop the old index as we dont need it 
genre_temp;
display(genre_temp.head(2)) ##quick view
genre_temp.shape ## check the shape to ensure the number of row matches with the previous dataframes rows
2020-03-23 21:02:01,437 [12620] WARNING  py.warnings:110: [JupyterRequire] C:\Users\Consultant\Anaconda3\lib\site-packages\pandas\core\generic.py:5208: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


genres imdb_score
0 [Action, Adventure, Fantasy, Sci-Fi] 7.9
1 [Action, Adventure, Fantasy] 7.1
Out[27]:
(3789, 2)

Converting Categorical genre to one hot encoding

In [28]:
##one hot encoding
mlb = MultiLabelBinarizer() ##suing sklearns
genre_onehot = pd.DataFrame(mlb.fit_transform(genre_temp.genres), columns = mlb.classes_)
display(genre_onehot.head(2))#quick view
genre_onehot.shape## check the shape to ensure the number of row matches with the previous dataframes from which we created this
Action Adventure Animation Biography Comedy Crime Documentary Drama Family Fantasy ... Music Musical Mystery Romance Sci-Fi Short Sport Thriller War Western
0 1 1 0 0 0 0 0 0 0 1 ... 0 0 0 0 1 0 0 0 0 0
1 1 1 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0

2 rows × 23 columns

Out[28]:
(3789, 23)
In [29]:
# # Merge two Dataframes on index of both the dataframes
# mergedDf = genre_temp.merge(genre_onehot, left_index=True, right_index=True)
# mergedDf.astype('int32').dtypes
# display(mergedDf)
# display(mergedDf.shape)
# mergedDf.astype({'imdb_score': 'int32'}).dtypes;
In [30]:
##creating a dataset with the imdb rating to get a mean imdb score for each genre
xyz = genre_onehot.multiply(genre_temp['imdb_score'], axis = 'index')
display(xyz.head(2))
xyz.shape
Action Adventure Animation Biography Comedy Crime Documentary Drama Family Fantasy ... Music Musical Mystery Romance Sci-Fi Short Sport Thriller War Western
0 7.9 7.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.9 ... 0.0 0.0 0.0 0.0 7.9 0.0 0.0 0.0 0.0 0.0
1 7.1 7.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.1 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2 rows × 23 columns

Out[30]:
(3789, 23)
In [31]:
#replacing the 0s with NaNs so that we can get a mean score for for each genre
xyz2 =xyz.replace(0, np.NaN)
#getting the mean score for each genre
means = xyz2.mean(axis = 0)
means;
In [32]:
#means2 = means
#means.plot.bar()
In [33]:
##converting the series to datafram so we can get better plots using plotly
##resetting the index which the series dataframe has genre as the index and we need genre to plot
means = means.to_frame().reset_index() 

#renaming the columns to better names
means = means.rename(columns={'index': "Genre", 0: "Avg_IMDB_rating"})
means.head(2)
Out[33]:
Genre Avg_IMDB_rating
0 Action 6.285989
1 Adventure 6.454961
In [34]:
##using plotly to generate the bar plots
fig = px.bar(means, x = "Genre", y = "Avg_IMDB_rating" )
fig.show()

We can see that maximum of the genre have similar score that ranges between 6.2 to 7.2 so GENRE may not be a good feature for us as it will not provide any useful insight. So, we will be dropping genre for the initial model and maybe revisit later if needed

In [35]:
##checking for complete cases i.e number of rows without any missing value/ NaNs in any of the columns
complete_cases = df3[~df3.isna().any(axis=1)].shape[0]
print('We have {} complete cases i.e number of rows without any missing value/ NaNs in any of the columns'.format(complete_cases))
print('We still have {} rows with  missing value in any of the columns'.format(df3.shape[0]-complete_cases))
We have 3655 complete cases i.e number of rows without any missing value/ NaNs in any of the columns
We still have 134 rows with  missing value in any of the columns

.

Again, looking at the ratio of missing values

In [36]:
#using the missing_value function saved under functions module
#missing_value functions takes in a dataframe and gives a table and plot of the total nulls and percentage of nulls for each columns in the dataframe
functions.missing_value(df3)
null_percentage total_null_values
color 0.05 2
director_name 0.00 0
num_critic_for_reviews 0.03 1
duration 0.03 1
director_facebook_likes 0.00 0
actor_3_facebook_likes 0.26 10
actor_2_name 0.13 5
actor_1_facebook_likes 0.08 3
gross 0.00 0
genres 0.00 0
actor_1_name 0.08 3
movie_title 0.00 0
num_voted_users 0.00 0
cast_total_facebook_likes 0.00 0
actor_3_name 0.26 10
facenumber_in_poster 0.16 6
plot_keywords 0.82 31
movie_imdb_link 0.00 0
num_user_for_reviews 0.00 0
language 0.08 3
country 0.00 0
content_rating 1.35 51
budget 0.00 0
title_year 0.00 0
actor_2_facebook_likes 0.13 5
imdb_score 0.00 0
aspect_ratio 1.95 74
movie_facebook_likes 0.00 0

We can see that feature 'aspect_ratio' has the highest missing values which is 74 values which constitues of 1.95% percentage of the data. Before, we impute the data we want to inspect the feature

In [37]:
##inspecting aspect_ratio
display(df3.aspect_ratio.unique())
df3.aspect_ratio.value_counts()
array([ 1.78,  2.35,  1.85,  2.  ,  2.2 ,  2.39,  2.24,  1.66,  1.5 ,
        1.77,  2.4 ,  1.37,   nan,  2.76,  1.33,  1.18,  2.55,  1.75,
       16.  ])
Out[37]:
2.35     1946
1.85     1581
1.37       50
1.78       41
1.66       40
1.33       19
2.39       11
2.20       10
2.40        3
2.76        3
2.00        3
1.75        2
2.24        1
1.18        1
2.55        1
1.77        1
16.00       1
1.50        1
Name: aspect_ratio, dtype: int64
  • We can see that the data is dominant with 2.35 and 1.85 aspect ratios. So we can group the rest into one category which not 2.35 and not 1.85 and inspect their mean score for further analysis
In [38]:
ar1 = np.mean(df3.imdb_score[df3.aspect_ratio == 1.85]) ##mean imdb_score for movies with aspect_ratio of 1.85
ar2 = np.mean(df3.imdb_score[df3.aspect_ratio == 2.35]) ##mean imdb_score for movies with aspect_ratio of 2.35
ar3 = np.mean(df3.imdb_score[(df3.aspect_ratio != 2.35) & (df3.aspect_ratio != 1.85)]) ##mean imdb_score for movies with remaining aspect_ratio 

print('The mean imdb_score for movies with aspect_ratio of 1.85 is {} '.format(ar1))
print('The mean imdb_score for movies with aspect_ratio of 2.35 is {} '.format(ar2))
print('The mean imdb_score for movies with aspect_ratio other than 1.85 and 2.35 is {}'.format(ar3))
The mean imdb_score for movies with aspect_ratio of 1.85 is 6.369576217583823 
The mean imdb_score for movies with aspect_ratio of 2.35 is 6.50786228160329 
The mean imdb_score for movies with aspect_ratio other than 1.85 and 2.35 is 6.672519083969459

From this we can see that the mean imdb_score for movies any aspect ratio is similar ranging from 6.3 to 6.6. Since, there is not much difference in the score for different aspect ratio, 'aspect_ratio' feature will not give us much informatin and can be dropped

By inspecting feature 'genre', we concluded that we will remove genre

We will also remove 'movie_imdb_link' as it does not give any information for our analysis

In [39]:
##Removing aspect_ratio and genres 
df4 = df3.drop(columns=['genres', 'aspect_ratio'])
df4 = df4.drop(columns=['movie_imdb_link'])
In [40]:
##once again inspecting for missing value from remaining data
#functions.missing_value(df4)
In [41]:
df4.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3789 entries, 0 to 5042
Data columns (total 25 columns):
color                        3787 non-null object
director_name                3789 non-null object
num_critic_for_reviews       3788 non-null float64
duration                     3788 non-null float64
director_facebook_likes      3789 non-null float64
actor_3_facebook_likes       3779 non-null float64
actor_2_name                 3784 non-null object
actor_1_facebook_likes       3786 non-null float64
gross                        3789 non-null float64
actor_1_name                 3786 non-null object
movie_title                  3789 non-null object
num_voted_users              3789 non-null int64
cast_total_facebook_likes    3789 non-null int64
actor_3_name                 3779 non-null object
facenumber_in_poster         3783 non-null float64
plot_keywords                3758 non-null object
num_user_for_reviews         3789 non-null float64
language                     3786 non-null object
country                      3789 non-null object
content_rating               3738 non-null object
budget                       3789 non-null float64
title_year                   3789 non-null float64
actor_2_facebook_likes       3784 non-null float64
imdb_score                   3789 non-null float64
movie_facebook_likes         3789 non-null int64
dtypes: float64(12), int64(3), object(10)
memory usage: 769.6+ KB
In [42]:
# # to change use .astype() 
# #df2['country'] = df2.country.astype(float)
# #df2.astype({'country': 'float64'}).dtypes.copy()
# df2["country"] = df2.country.astype(float)
In [43]:
df4.describe()
Out[43]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews budget title_year actor_2_facebook_likes imdb_score movie_facebook_likes
count 3788.000000 3788.000000 3789.000000 3779.000000 3786.000000 3.789000e+03 3.789000e+03 3789.000000 3783.000000 3789.000000 3.789000e+03 3789.000000 3784.000000 3789.000000 3789.000000
mean 160.677402 109.802798 786.665875 736.823763 7503.154517 5.020490e+07 1.014201e+05 11114.848509 1.385937 321.337820 4.151169e+07 2003.011085 1932.961681 6.461547 8904.882291
std 122.770780 22.760215 3038.447982 1822.996034 15451.381326 6.872488e+07 1.506911e+05 18913.388976 2.065335 402.809142 1.075150e+08 9.994079 4452.752306 1.057753 21188.159966
min 1.000000 34.000000 0.000000 0.000000 0.000000 1.620000e+02 5.000000e+00 0.000000 0.000000 1.000000 2.180000e+02 1920.000000 0.000000 1.600000 0.000000
25% 71.000000 95.000000 10.000000 182.000000 716.250000 6.543194e+06 1.665100e+04 1801.000000 0.000000 100.000000 9.500000e+06 1999.000000 358.000000 5.900000 0.000000
50% 131.000000 105.000000 58.000000 424.000000 1000.000000 2.720000e+07 4.961200e+04 3843.000000 1.000000 199.000000 2.400000e+07 2005.000000 658.500000 6.600000 186.000000
75% 217.250000 120.000000 221.000000 683.000000 12000.000000 6.495596e+07 1.207860e+05 15835.000000 2.000000 385.000000 5.000000e+07 2010.000000 970.000000 7.200000 10000.000000
max 813.000000 330.000000 23000.000000 23000.000000 640000.000000 7.605058e+08 1.689764e+06 656730.000000 43.000000 5060.000000 4.200000e+09 2016.000000 137000.000000 9.300000 349000.000000
In [44]:
##count of null values ins each columns
df4.isnull().sum()
Out[44]:
color                         2
director_name                 0
num_critic_for_reviews        1
duration                      1
director_facebook_likes       0
actor_3_facebook_likes       10
actor_2_name                  5
actor_1_facebook_likes        3
gross                         0
actor_1_name                  3
movie_title                   0
num_voted_users               0
cast_total_facebook_likes     0
actor_3_name                 10
facenumber_in_poster          6
plot_keywords                31
num_user_for_reviews          0
language                      3
country                       0
content_rating               51
budget                        0
title_year                    0
actor_2_facebook_likes        5
imdb_score                    0
movie_facebook_likes          0
dtype: int64

The rest of the data have quite less missing values and we can impute the data for those missing value so that we do not lose any more data points</note>

Imputing numeric values

  • First we will create a dataframe having only numeric columns with missing values
  • Then we will use MICE imputation to impute the missing values
  • The resulting dataset will be a numpy array which we will convert into a Dataframe
  • We will then update the original dataframe with the values in the imputed dataframe
In [45]:
##dcreating dataframe havinf numeric columns with missing values
df_numeric = df4[['num_critic_for_reviews','duration','actor_3_facebook_likes','actor_1_facebook_likes','facenumber_in_poster','actor_2_facebook_likes']]
df_numeric;
In [46]:
#count of missing values for eath column
df_numeric.isna().sum()
Out[46]:
num_critic_for_reviews     1
duration                   1
actor_3_facebook_likes    10
actor_1_facebook_likes     3
facenumber_in_poster       6
actor_2_facebook_likes     5
dtype: int64
In [47]:
#imputing using MICE (sklearn.impute contains IterativeImputer which uses MICE to impute)
mice_imputer= IterativeImputer()
imputed=mice_imputer.fit_transform(df_numeric)
In [48]:
#the sklearn iterativeimputer outputs the data in numpy array form 
imputed;
In [49]:
##converting the imputed numpy array into dataframe adn giving them column names in the order of dataframe which was used to impute

df_imputed=pd.DataFrame(imputed, columns=['num_critic_for_reviews','duration','actor_3_facebook_likes','actor_1_facebook_likes','facenumber_in_poster','actor_2_facebook_likes'])
In [50]:
df_imputed;
In [51]:
#check if the imputation worked or not-- there should be 0 for all numeric columns
df_imputed.isna().sum()
Out[51]:
num_critic_for_reviews    0
duration                  0
actor_3_facebook_likes    0
actor_1_facebook_likes    0
facenumber_in_poster      0
actor_2_facebook_likes    0
dtype: int64
In [52]:
#creating a copy of the dataframe and updating it with the imputed data-points
df5 = df4
df5 = df5.reset_index().drop(columns=['index'])
df5.update(df_imputed)
In [53]:
##cheking the count of missing data in the columns
df5.isna().sum()
Out[53]:
color                         2
director_name                 0
num_critic_for_reviews        0
duration                      0
director_facebook_likes       0
actor_3_facebook_likes        0
actor_2_name                  5
actor_1_facebook_likes        0
gross                         0
actor_1_name                  3
movie_title                   0
num_voted_users               0
cast_total_facebook_likes     0
actor_3_name                 10
facenumber_in_poster          0
plot_keywords                31
num_user_for_reviews          0
language                      3
country                       0
content_rating               51
budget                        0
title_year                    0
actor_2_facebook_likes        0
imdb_score                    0
movie_facebook_likes          0
dtype: int64
  • We can see that all of the numeric data are complete ine. there is no missing value for numeric data

inspecting categorical content_rating

In [54]:
##content_rating
df5.content_rating.unique()
Out[54]:
array(['PG-13', 'PG', 'G', 'R', 'Approved', 'NC-17', nan, 'X',
       'Not Rated', 'Unrated', 'M', 'GP', 'Passed'], dtype=object)
In [55]:
##removing the null values from features 'content_rating'
df6 = df5[df5['content_rating'].notna()]
df6.shape
Out[55]:
(3738, 25)
In [ ]:
 
In [56]:
df6['content_rating'].value_counts()
Out[56]:
R            1699
PG-13        1283
PG            561
G              91
Not Rated      42
Unrated        24
Approved       17
X               9
NC-17           6
Passed          3
M               2
GP              1
Name: content_rating, dtype: int64

According to the history of naming these different content ratings, we find,PG-13 = G, M = GP = PG, X = NC-17. We want to replace PG-13 with G, M and GP with PG, replace X with NC-17, because these two are what we use nowadays.

In [57]:
df6['content_rating'] = df6['content_rating'].replace('PG-13', 'G')
df6['content_rating'] = df6['content_rating'].replace(['M','GP'], 'G')
df6['content_rating'] = df6['content_rating'].replace('X', 'NC-17')
df6['content_rating'] = df6['content_rating'].replace(['Approved','Passed','Unrated','Not Rated'], 'R')
2020-03-23 21:02:43,605 [12620] WARNING  py.warnings:110: [JupyterRequire] C:\Users\Consultant\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


2020-03-23 21:02:43,615 [12620] WARNING  py.warnings:110: [JupyterRequire] C:\Users\Consultant\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


2020-03-23 21:02:43,625 [12620] WARNING  py.warnings:110: [JupyterRequire] C:\Users\Consultant\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


2020-03-23 21:02:43,635 [12620] WARNING  py.warnings:110: [JupyterRequire] C:\Users\Consultant\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [58]:
##checking after cleaning the column
display(df6['content_rating'].value_counts())

##plotting the data
df6['content_rating'].value_counts().plot.bar()
R        1785
G        1377
PG        561
NC-17      15
Name: content_rating, dtype: int64
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x23d27660208>
In [59]:
#Adding new columns for the purpose of EDA
In [60]:
#profilt column
df6['profit'] = df6.gross - df6.budget
         
2020-03-23 21:02:44,997 [12620] WARNING  py.warnings:110: [JupyterRequire] C:\Users\Consultant\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [61]:
#plotting the 'color' column
a=df6.color.value_counts()
display(a)
a.plot.bar()
Color               3614
 Black and White     122
Name: color, dtype: int64
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x23d276a6908>
In [62]:
color_p=(3614/(3614+122))*100
bw_p=(122/(3614+122))*100

print('The color column in dataset consists of {} percent color and {} percent black&white'.format(color_p, bw_p))
The color column in dataset consists of 96.73447537473233 percent color and 3.2655246252676657 percent black&white

Since, the color column is extremely skedwed towards color this feature will not provide much help in our model. Thus, we will drop color column

In [ ]:
 
In [63]:
#plotting the 'language' column
lang= df6.language.value_counts()
display(lang)
lang.plot.bar()
English       3577
French          34
Spanish         24
Mandarin        14
German          11
Japanese        10
Italian          7
Cantonese        7
Portuguese       5
Hindi            5
Korean           4
Norwegian        4
Thai             3
Dutch            3
Danish           3
Persian          3
Hebrew           2
Dari             2
Indonesian       2
Aboriginal       2
Maya             1
None             1
Romanian         1
Russian          1
Hungarian        1
Kazakh           1
Filipino         1
Mongolian        1
Aramaic          1
Czech            1
Bosnian          1
Arabic           1
Zulu             1
Vietnamese       1
Name: language, dtype: int64
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x23d27733988>

We can see that for the 'language' column as well the data is extremely skewed and almost all the movies in the dataset has English as the language. This feature also will not provide much help in our model therefore we will drop language column as well

In [64]:
##Removing 'language' and 'color' features
##removing the null values from features 'content_rating'
df7 = df6.drop(columns=['color','language'])
df7.shape
Out[64]:
(3738, 24)
In [65]:
df7.columns;
In [66]:
#plotting the 'country' column
c= df7.country.value_counts()
display(c)
c.plot.bar()
USA               2971
UK                 309
France             103
Germany             77
Canada              63
Australia           38
Spain               22
Japan               15
Hong Kong           13
China               13
Italy               11
Mexico              10
Denmark              9
New Zealand          9
South Korea          7
Ireland              7
India                5
Brazil               5
Iran                 4
Norway               4
Thailand             4
Czech Republic       3
South Africa         3
Netherlands          3
Argentina            3
Russia               3
Israel               2
Hungary              2
Taiwan               2
Romania              2
Georgia              1
West Germany         1
Aruba                1
Chile                1
Greece               1
New Line             1
Poland               1
Philippines          1
Iceland              1
Finland              1
Colombia             1
Official site        1
Belgium              1
Peru                 1
Indonesia            1
Afghanistan          1
Name: country, dtype: int64
Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x23d2784b8c8>

We can see that majority of the country is USA and then the second is UK followed by France, Germany and Canada. Since all other have less that 50 value we will group them all as 'Other'

In [67]:
#df7 = df6 ##making a copy of dataframe
##replacing all other countries except the specified ones as Other
df7.loc[~df7["country"].isin(['USA','UK','France','Germany','Canada']), "country"] = "Other"

#plotting the 'country' column
cc= df7.country.value_counts()
display(cc)
cc.plot.bar()
USA        2971
UK          309
Other       215
France      103
Germany      77
Canada       63
Name: country, dtype: int64
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x23d27855288>

Data Visualization

  • The following charts shows some of the visualization of data
  • While hovering over the chart, most of the charts dispaly vital information
In [68]:
# fig = px.histogram(df7, x="title_year")

# fig.update_layout(
    
    
#     title_text='Movies release over the years', # title of plot
#     xaxis_title_text='title_year', # xaxis label
#     yaxis_title_text='Count', # yaxis label
#     bargap=0.1, # gap between bars of adjacent location coordinates
#     #bargroupgap=0.1 # gap between bars of the same location coordinates
# )

# fig.show()

Movies released over year

In [69]:
fig = go.Figure(data=[go.Histogram(x=df7.title_year)])

fig.update_layout(
    title={
        'text': "Movies release over the years",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title_text='Year', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
)

fig.update_traces(marker_color='rgb(128,0,128)',opacity=0.4)

fig.show()
  • We can see that the data is highly skewed to the right.
  • We can also see that most of the movies inf the dataset were released after 1980s
  • Therefore, we can take only data points for our analysis
In [70]:
##Taking data for movies replease on or after 1980
df8 = df7[df7.title_year >= 1980]
In [71]:
#df8 = df7

#df8 = df8[df8.title_year >= 1980]

#df7.shape

#df8.shape
df8.columns;

Top Profit Earning Movies

In [72]:
##we will be looking at top 15 movies 
ccc = df8.nlargest(15, ['profit'])

##using plotly to generate the bar plots
# fig = px.bar(ccc, x = "movie_title", y = "profit" )
# fig.show()
fig = px.bar(ccc, x='movie_title', y='profit',
             hover_data=['director_name', 'title_year', 'budget', 'gross','actor_1_name'], #color='lifeExp',
             labels={'Movies by profit'}, height=600)


# fig = go.Figure(data=[go.Bar(x=ccc)])

fig.update_layout(
    title={
        'text': "Movies by profit",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title_text='Movie title', # xaxis label
    yaxis_title_text='profit', # yaxis label
    #bargap=0.2, # gap between bars of adjacent location coordinates
)

fig.update_traces(marker_color='rgb(128,128,128)',opacity=0.75)

display(ccc[['movie_title','title_year']])
fig.show()
movie_title title_year
0 Avatar 2009.0
28 Jurassic World 2015.0
25 Titanic 1997.0
2723 E.T. the Extra-Terrestrial 1982.0
16 The Avengers 2012.0
482 The Lion King 1994.0
230 Star Wars: Episode I - The Phantom Menace 1999.0
64 The Dark Knight 2008.0
419 The Hunger Games 2012.0
767 Deadpool 2016.0
180 The Hunger Games: Catching Fire 2013.0
659 Jurassic Park 1993.0
494 Despicable Me 2 2013.0
769 American Sniper 2014.0
324 Finding Nemo 2003.0

Top Grossing Movies

In [73]:
##we will be looking at top 15 movies 
ccc = df8.nlargest(15, ['gross'])


##using plotly to generate the bar plots
# fig = px.bar(ccc, x = "movie_title", y = "profit" )
# fig.show()
fig = px.bar(ccc, x='movie_title', y='gross',
             hover_data=['director_name','title_year', 'budget', 'gross','actor_1_name'], #color='lifeExp',
             labels={'Movies by Gross earning'}, height=650)


# fig = go.Figure(data=[go.Bar(x=ccc)])

fig.update_layout(
    title={
        'text': "Movies by Gross earning",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title_text='Movie title', # xaxis label
    yaxis_title_text='Gross Earning', # yaxis label
    #bargap=0.2, # gap between bars of adjacent location coordinates
)

fig.update_traces(marker_color='rgb(66,12,9)',opacity=0.75)

display(ccc[['movie_title','gross']])
fig.show()
movie_title gross
0 Avatar 760505847.0
25 Titanic 658672302.0
28 Jurassic World 652177271.0
16 The Avengers 623279547.0
64 The Dark Knight 533316061.0
230 Star Wars: Episode I - The Phantom Menace 474544677.0
7 Avengers: Age of Ultron 458991599.0
3 The Dark Knight Rises 448130642.0
552 Shrek 2 436471036.0
2723 E.T. the Extra-Terrestrial 434949459.0
180 The Hunger Games: Catching Fire 424645577.0
12 Pirates of the Caribbean: Dead Man's Chest 423032628.0
482 The Lion King 422783777.0
42 Toy Story 3 414984497.0
31 Iron Man 3 408992272.0

Highest Budget Movies

In [74]:
##we will be looking at top 15 movies 
ccc = df8.nlargest(15, ['budget'])


##using plotly to generate the bar plots
# fig = px.bar(ccc, x = "movie_title", y = "profit" )
# fig.show()
fig = px.bar(ccc, x='movie_title', y='budget',
             hover_data=['director_name', 'budget', 'gross','actor_1_name'], #color='lifeExp',
             labels={'Movies by Budget'}, height=650)


# fig = go.Figure(data=[go.Bar(x=ccc)])

fig.update_layout(
    title={
        'text': "Movies by Budget",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title_text='Movie title', # xaxis label
    yaxis_title_text='Movie budget', # yaxis label
    #bargap=0.2, # gap between bars of adjacent location coordinates
)

fig.update_traces(marker_color='rgb(106,13,73)',opacity=0.75)

display(ccc[['movie_title','budget']])
fig.show()
movie_title budget
3247 Lady Vengeance 4.200000e+09
2670 Fateless 2.500000e+09
2122 Princess Mononoke 2.400000e+09
2133 Steamboy 2.127520e+09
2961 Akira 1.100000e+09
3577 Godzilla 2000 1.000000e+09
2718 Kabhi Alvida Naa Kehna 7.000000e+08
3242 Tango 7.000000e+08
1255 Red Cliff 5.536320e+08
2889 The Legend of Suriyothai 4.000000e+08
958 The Messenger: The Story of Joan of Arc 3.900000e+08
1 Pirates of the Caribbean: At World's End 3.000000e+08
2459 Ong-bak 2 3.000000e+08
4 John Carter 2.637000e+08
6 Tangled 2.600000e+08
In [ ]:
 

Relation between highest score, budget and country

In [75]:
ccc = df8.nlargest(35, ['imdb_score'])

df = ccc
fig = px.scatter(df, x="imdb_score", y="budget", facet_col="country", color="movie_title")
fig.show()
  • We can see that the imdb_score for high budget movies are not necessarily high
  • Most of the top 35 imdb_score movies in USA or any other countries have budget less that 100M
  • The movie with highest imdb_score in USA budget less than $50M

Relation between highest score, gross earning and country

In [76]:
ccc = df8.nlargest(35, ['imdb_score'])

df = ccc
fig = px.scatter(df, x="imdb_score", y="gross", facet_col="country", color="movie_title")
fig.show()
  • We can see that the having high imdb_score does not necessarily mean more people will watch it and their gross earning is higher
  • Imdb_score seem to be a good indicator of the earning of the movies only to certain extent
  • The movie with highest imdb_score in USA gross earning less than $50M

Visualizing relation between imdb_score and gross earning

For top 200 movies with higest score

In [77]:
ccc = df8.nlargest(200, ['imdb_score'])

df = ccc
fig = px.scatter(df, x="imdb_score", y="gross", color="movie_title")
fig.show()
  • For the movies having high imdb score there does not seem to be strong relation between imdb score and gross earning.

For bottom 200 movies with higest score

In [78]:
ccc = df8.nsmallest(300, ['imdb_score'])

df = ccc
fig = px.scatter(df, x="imdb_score", y="gross", color="movie_title")
fig.show()
  • For the movies having low imdb score there seems to be trend between imdb score and gross earning.

Relationship between movie_facebook_likes and imdb_score

In [79]:
ccc = df8
df = ccc
fig = px.scatter(df, x="imdb_score", y="movie_facebook_likes", color="content_rating")
fig.show()
In [ ]:
 

Critically Acclaimed vs Commercially Successful

In [80]:
ccc = df8.nlargest(200, ['profit'])



df = ccc
fig = px.scatter(df, x="imdb_score", y="gross",size ="profit" , color="content_rating", hover_data=["movie_title"])



fig.add_shape(
        # Line Horizontal
            type="line",
            x0=7,
            y0=0,
            x1=7,
            y1=900000000,
            line=dict(
                color="LightSeaGreen",
                width=4,
                dash="dashdot",
            ),
    )    


fig.add_shape(
        # Line Horizontal
            type="line",
            x0=4,
            y0=400000000,
            x1=9.5,
            y1=400000000,
            line=dict(
                color="LightSeaGreen",
                width=4,
                dash="dashdot",
            ),
    )    
    

    
fig.show()
  • Movies in the 4 th and 1st quadrant are commercially succesfull movies whereas movies on the 1st and 2ndd quadrant are critically acclamed movies.
  • Movies in the 1st quadrant are both critically acclaimed as well as commercially succesful movies
  • Movies in the 3rd quadrant are flop movies

Data Preprocessing

Inspecting names and removing if unnecessary

unique director names

df8.director_name.nunique()

In [81]:
##unique actor names in each category
df8[['actor_1_name','actor_2_name','actor_3_name']].nunique()
Out[81]:
actor_1_name    1422
actor_2_name    2167
actor_3_name    2575
dtype: int64
In [82]:
## unique actor names combined
df8.groupby(['actor_1_name','actor_2_name','actor_3_name']).ngroups
Out[82]:
3615
  • One way of utilising categorical variables is to one-hot-encode and create each of the categorical variable into a column. But we have mo many unique names tthet they will not provide any value being a predictor and also it is not a feasible solution here as we will end up getting huge unnecessary features. So, we will just drop them.
  • The same is the case with the plot keywords, so we will drop plot_keywords as well.
In [111]:
##Removing the above mentioned columns
df9 = df8.drop(columns=['director_name', 'actor_1_name','actor_2_name','actor_3_name','plot_keywords'])
In [ ]:
 
In [ ]:
 

Removing linearly dependent and highly correlated variables

earlier we introduced profit for EDA purpose. Since, profit is dependent to budget and gross earning we will remove them

In [112]:
df9 = df9.drop(columns=['profit'])
In [113]:
##Checking if there is any minning values
df9.isna().sum()
Out[113]:
num_critic_for_reviews       0
duration                     0
director_facebook_likes      0
actor_3_facebook_likes       0
actor_1_facebook_likes       0
gross                        0
movie_title                  0
num_voted_users              0
cast_total_facebook_likes    0
facenumber_in_poster         0
num_user_for_reviews         0
country                      0
content_rating               0
budget                       0
title_year                   0
actor_2_facebook_likes       0
imdb_score                   0
movie_facebook_likes         0
dtype: int64
In [ ]:
 
In [114]:
# Rescale the font size of the seaborn plots
sns.set(font_scale=1)
In [115]:
# Calculate the correlation between all the variables
c = df9.corr()
# Use heatmap to see the correlation between all the variables (including target variables)
plt.figure(figsize=(30,15))
sns.heatmap(c, annot=True)
Out[115]:
<matplotlib.axes._subplots.AxesSubplot at 0x23d2bf018c8>
In [116]:
df9.columns
Out[116]:
Index(['num_critic_for_reviews', 'duration', 'director_facebook_likes',
       'actor_3_facebook_likes', 'actor_1_facebook_likes', 'gross',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'facenumber_in_poster', 'num_user_for_reviews', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'movie_facebook_likes'],
      dtype='object')
  • Since there is high correlation between actors facebook likes we will just drop them and use the total cast facebook likes
  • Also there is some correlation between num_user_for_reviews and num_critic_for_reviews so we will combine them by taking their ratio

Feature engineering and removing redundant features

In [117]:
df9['user_critic_ratio'] = df9['num_critic_for_reviews'] / df9['num_user_for_reviews'] 

df10 = df9.drop(columns=['movie_title','num_critic_for_reviews','num_user_for_reviews','actor_1_facebook_likes','actor_2_facebook_likes','actor_3_facebook_likes'])
In [118]:
df10.content_rating.value_counts()
Out[118]:
R        1735
G        1360
PG        537
NC-17      13
Name: content_rating, dtype: int64
In [119]:
#df# Convert country to one hot encoding
df10['USA'] = [1 if i == 'USA' else 0 for i in df10.country]
df10['UK'] = [1 if i == 'UK' else 0 for i in df10.country]
df10['France'] = [1 if i == 'France' else 0 for i in df10.country]
df10['Germany'] = [1 if i == 'Germany' else 0 for i in df10.country]
df10['Canada'] = [1 if i == 'Canada' else 0 for i in df10.country]
df10['Other'] = [1 if i == 'Other' else 0 for i in df10.country]

df10['R'] = [1 if i == 'R' else 0 for i in df10.country]
df10['G'] = [1 if i == 'G' else 0 for i in df10.country]
df10['PG'] = [1 if i == 'PG' else 0 for i in df10.country]
df10['NC-17'] = [1 if i == 'NC-17' else 0 for i in df10.country]

df10 = df10.drop(columns=['country','content_rating'])
In [120]:
df10;
In [121]:
# Calculate the correlation between all the variables
c = df10.corr()
# Use heatmap to see the correlation between all the variables (including target variables)
plt.figure(figsize=(30,15))
sns.heatmap(c, annot=True)
Out[121]:
<matplotlib.axes._subplots.AxesSubplot at 0x23d27b516c8>
  • The features do not have high correlation i.e (correlation > 0.65)
In [122]:
# Rescale the font size of seaborn plots
sns.set(font_scale=2)
In [123]:
# Use pairplot to see how the data is distributed with each other
#sns.pairplot(df10[df10.columns.to_list()])
In [124]:
# dfx.columns
In [125]:
#df10.to_csv('df10.csv', index = False)

#dfx = pd.read_csv('df10.csv')
In [132]:
# Make a copy of final cleaned dataset
dfx = df10.copy()
In [133]:
# Import model from scikit learn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
In [134]:
dfx.head(3)
Out[134]:
duration director_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster budget title_year imdb_score movie_facebook_likes ... USA UK France Germany Canada Other R G PG NC-17
0 178.0 0.0 760505847.0 886204 4834 0.0 237000000.0 2009.0 7.9 33000 ... 1 0 0 0 0 0 0 0 0 0
1 169.0 563.0 309404152.0 471220 48350 0.0 300000000.0 2007.0 7.1 0 ... 1 0 0 0 0 0 0 0 0 0
2 148.0 0.0 200074175.0 275868 11700 1.0 245000000.0 2015.0 6.8 85000 ... 0 1 0 0 0 0 0 0 0 0

3 rows × 21 columns

In [135]:
print(dfx.columns.tolist())
['duration', 'director_facebook_likes', 'gross', 'num_voted_users', 'cast_total_facebook_likes', 'facenumber_in_poster', 'budget', 'title_year', 'imdb_score', 'movie_facebook_likes', 'user_critic_ratio', 'USA', 'UK', 'France', 'Germany', 'Canada', 'Other', 'R', 'G', 'PG', 'NC-17']
In [136]:
dfx = dfx[['duration', 'gross', 'budget', 'num_voted_users', 'facenumber_in_poster','director_facebook_likes', 'cast_total_facebook_likes', 'movie_facebook_likes', 'user_critic_ratio','USA', 'UK', 'France', 'Germany', 'Canada', 'Other','R', 'G', 'PG', 'NC-17','imdb_score']]
dfx.head(3)
Out[136]:
duration gross budget num_voted_users facenumber_in_poster director_facebook_likes cast_total_facebook_likes movie_facebook_likes user_critic_ratio USA UK France Germany Canada Other R G PG NC-17 imdb_score
0 178.0 760505847.0 237000000.0 886204 0.0 0.0 4834 33000 0.236739 1 0 0 0 0 0 0 0 0 0 7.9
1 169.0 309404152.0 300000000.0 471220 0.0 563.0 48350 0 0.243942 1 0 0 0 0 0 0 0 0 0 7.1
2 148.0 200074175.0 245000000.0 275868 1.0 0.0 11700 85000 0.605634 0 1 0 0 0 0 0 0 0 0 6.8
In [137]:
# Normalization function
def normalization(df):
    return (df.values - df.values.min()) / (df.values.max() - df.values.min())
In [138]:
dfx['duration'] = normalization(dfx.duration)

dfx['gross'] = normalization(dfx.gross)

dfx['budget'] = normalization(dfx.budget)

dfx['num_voted_users'] = normalization(dfx.num_voted_users)

dfx['facenumber_in_poster'] = normalization(dfx.facenumber_in_poster)

dfx['director_facebook_likes'] = normalization(dfx.director_facebook_likes)

dfx['cast_total_facebook_likes'] = normalization(dfx.cast_total_facebook_likes)

dfx['movie_facebook_likes'] = normalization(dfx.movie_facebook_likes)

dfx['user_critic_ratio'] = normalization(dfx.user_critic_ratio)
In [139]:
dfx.imdb_score.head(23)
Out[139]:
0     7.9
1     7.1
2     6.8
3     8.5
4     6.6
5     6.2
6     7.8
7     7.5
8     7.5
9     6.9
10    6.1
11    6.7
12    7.3
13    6.5
14    7.2
15    6.6
16    8.1
17    6.7
18    6.8
19    7.5
20    7.0
21    6.7
22    7.9
Name: imdb_score, dtype: float64
In [ ]:
 
In [140]:
# Separate dependet variables and independent variables from the original dataset
X = dfx[dfx.columns[:-1]]
y = dfx[dfx.columns[-1]]
In [141]:
# Split dataset into train and test with a ratio 0.8 to 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
In [ ]:
 

Xgboost regressor

In [142]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error
In [ ]:
 
In [143]:
data_dmatrix = xgb.DMatrix(data=X,label=y)
In [144]:
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.3,
                max_depth = 100, alpha = 15, n_estimators = 500)
In [145]:
xg_reg.fit(X_train,y_train)

preds = xg_reg.predict(X_test)
[21:08:05] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.0.0/src/objective/regression_obj.cu:167: reg:linear is now deprecated in favor of reg:squarederror.
In [146]:
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
RMSE: 0.710796
In [148]:
r2_score(y_test, preds)
Out[148]:
0.5395467522999379

Random forest regressor

In [149]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

# Initialize a Random Forest Model (hyperparameter is not fine tuned)
rf = RandomForestRegressor(n_estimators=500, max_depth=100, min_samples_leaf=5)

# Train the Model
rf.fit(X_train, y_train)

# Predict results on the test dataset
y_hat1 = rf.predict(X_test)
In [150]:
rmse = np.sqrt(mse(y_test, y_hat1))
rmse
Out[150]:
0.7038588925201631
In [151]:
r2_score(y_test, y_hat1)
Out[151]:
0.5484902516130051
In [ ]:
 

KNN- Regressor

In [152]:
#import required packages
from sklearn import neighbors
from sklearn.metrics import mean_squared_error 
from math import sqrt
import matplotlib.pyplot as plt
%matplotlib inline
In [153]:
rmse_val = [] #to store rmse values for different k
for K in range(20):
    K = K+1
    model = neighbors.KNeighborsRegressor(n_neighbors = K)

    model.fit(X_train, y_train)  #fit the model
    pred=model.predict(X_test) #make prediction on test set
    error = sqrt(mean_squared_error(y_test,pred)) #calculate rmse
    rmse_val.append(error) #store rmse values
    print('RMSE value for k= ' , K , 'is:', error)
RMSE value for k=  1 is: 1.0815131908086022
RMSE value for k=  2 is: 0.9717864984801229
RMSE value for k=  3 is: 0.9354717810453421
RMSE value for k=  4 is: 0.9041987866703796
RMSE value for k=  5 is: 0.8815539963948157
RMSE value for k=  6 is: 0.8591590334622334
RMSE value for k=  7 is: 0.8554339879535672
RMSE value for k=  8 is: 0.8490220711891153
RMSE value for k=  9 is: 0.8447153401150415
RMSE value for k=  10 is: 0.8392126756382967
RMSE value for k=  11 is: 0.8384550292276498
RMSE value for k=  12 is: 0.8304852450615235
RMSE value for k=  13 is: 0.8286843316046625
RMSE value for k=  14 is: 0.8249146434487827
RMSE value for k=  15 is: 0.8244415604532876
RMSE value for k=  16 is: 0.8240690347848418
RMSE value for k=  17 is: 0.8255866742996189
RMSE value for k=  18 is: 0.822482060386058
RMSE value for k=  19 is: 0.8231751396571053
RMSE value for k=  20 is: 0.8242962455401108
In [154]:
#plotting the rmse values against k values
curve = pd.DataFrame(rmse_val) #elbow curve 
curve.plot()
Out[154]:
<matplotlib.axes._subplots.AxesSubplot at 0x23d2aab6908>
In [155]:
from sklearn.model_selection import GridSearchCV
params = {'n_neighbors':[1,2,3,4,5,6,7,8,9,10,11,12,13]}

knn = neighbors.KNeighborsRegressor()

model = GridSearchCV(knn, params, cv=5)
model.fit(X_train,y_train)
model.best_params_
Out[155]:
{'n_neighbors': 12}
In [156]:
model = neighbors.KNeighborsRegressor(n_neighbors = 12)

model.fit(X_train, y_train)  #fit the model
pred=model.predict(X_test) #make prediction on test set
error = sqrt(mean_squared_error(y_test,pred)) #calculate rmse
print('RMSE value for k= ' , K , 'is:', error)
RMSE value for k=  20 is: 0.8304852450615235
In [157]:
r2_score(y_test, pred)
Out[157]:
0.37142114204824506

ANN regression using Keras

In [158]:
X_train.shape
Out[158]:
(2916, 19)
In [173]:
import numpy as np
import pandas as pd
import keras
import keras.backend as kb
import tensorflow as tf

model = keras.Sequential([
    keras.layers.Dense(128, activation=tf.nn.relu, input_shape=[19]),
    keras.layers.Dense(64, activation=tf.nn.relu),
    keras.layers.Dense(32, activation=tf.nn.relu),
    keras.layers.Dense(1)
  ])
In [186]:
optimizer = tf.keras.optimizers.Adam(0.0099)
model.compile(loss='mean_squared_error',optimizer=optimizer)
model.fit(X_train,y_train,epochs=500)
Epoch 1/500
2916/2916 [==============================] - 1s 178us/step - loss: 0.4423
Epoch 2/500
2916/2916 [==============================] - 0s 131us/step - loss: 0.4656
Epoch 3/500
2916/2916 [==============================] - ETA: 0s - loss: 0.439 - 0s 132us/step - loss: 0.4374
Epoch 4/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.4418
Epoch 5/500
2916/2916 [==============================] - 0s 133us/step - loss: 0.4229 0s - loss: 0
Epoch 6/500
2916/2916 [==============================] - 0s 131us/step - loss: 0.4224 0s - loss: 0
Epoch 7/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.4010
Epoch 8/500
2916/2916 [==============================] - 0s 154us/step - loss: 0.4268
Epoch 9/500
2916/2916 [==============================] - 1s 172us/step - loss: 0.4211
Epoch 10/500
2916/2916 [==============================] - 0s 157us/step - loss: 0.4098
Epoch 11/500
2916/2916 [==============================] - 0s 132us/step - loss: 0.4086
Epoch 12/500
2916/2916 [==============================] - 0s 138us/step - loss: 0.4044
Epoch 13/500
2916/2916 [==============================] - 0s 153us/step - loss: 0.4096
Epoch 14/500
2916/2916 [==============================] - 0s 154us/step - loss: 0.4414
Epoch 15/500
2916/2916 [==============================] - 0s 134us/step - loss: 0.4284
Epoch 16/500
2916/2916 [==============================] - 0s 130us/step - loss: 0.4163
Epoch 17/500
2916/2916 [==============================] - 0s 133us/step - loss: 0.4096
Epoch 18/500
2916/2916 [==============================] - 0s 132us/step - loss: 0.4151
Epoch 19/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.4162
Epoch 20/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.4190
Epoch 21/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.4402
Epoch 22/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.4176
Epoch 23/500
2916/2916 [==============================] - 0s 159us/step - loss: 0.4119
Epoch 24/500
2916/2916 [==============================] - 0s 154us/step - loss: 0.4144
Epoch 25/500
2916/2916 [==============================] - 0s 131us/step - loss: 0.3954
Epoch 26/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.4156
Epoch 27/500
2916/2916 [==============================] - 0s 147us/step - loss: 0.4052
Epoch 28/500
2916/2916 [==============================] - 1s 206us/step - loss: 0.3954
Epoch 29/500
2916/2916 [==============================] - 0s 169us/step - loss: 0.4135
Epoch 30/500
2916/2916 [==============================] - 0s 144us/step - loss: 0.3978
Epoch 31/500
2916/2916 [==============================] - 0s 142us/step - loss: 0.4329
Epoch 32/500
2916/2916 [==============================] - 0s 121us/step - loss: 0.4329
Epoch 33/500
2916/2916 [==============================] - 0s 152us/step - loss: 0.4138
Epoch 34/500
2916/2916 [==============================] - 0s 159us/step - loss: 0.4242
Epoch 35/500
2916/2916 [==============================] - 0s 143us/step - loss: 0.4015
Epoch 36/500
2916/2916 [==============================] - 1s 192us/step - loss: 0.4569
Epoch 37/500
2916/2916 [==============================] - 0s 161us/step - loss: 0.4164
Epoch 38/500
2916/2916 [==============================] - 0s 154us/step - loss: 0.4061
Epoch 39/500
2916/2916 [==============================] - 0s 133us/step - loss: 0.4065
Epoch 40/500
2916/2916 [==============================] - 0s 132us/step - loss: 0.4064
Epoch 41/500
2916/2916 [==============================] - 0s 135us/step - loss: 0.4293
Epoch 42/500
2916/2916 [==============================] - 0s 140us/step - loss: 0.4257
Epoch 43/500
2916/2916 [==============================] - 0s 138us/step - loss: 0.3967
Epoch 44/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.4047
Epoch 45/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.3995
Epoch 46/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.4063
Epoch 47/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.3870
Epoch 48/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.4070
Epoch 49/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.4041
Epoch 50/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.4041
Epoch 51/500
2916/2916 [==============================] - 0s 155us/step - loss: 0.4169
Epoch 52/500
2916/2916 [==============================] - 0s 155us/step - loss: 0.4054
Epoch 53/500
2916/2916 [==============================] - 0s 136us/step - loss: 0.4141
Epoch 54/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.4158
Epoch 55/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.4178
Epoch 56/500
2916/2916 [==============================] - 0s 135us/step - loss: 0.4200
Epoch 57/500
2916/2916 [==============================] - 0s 140us/step - loss: 0.4450
Epoch 58/500
2916/2916 [==============================] - 0s 136us/step - loss: 0.3930
Epoch 59/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.4125
Epoch 60/500
2916/2916 [==============================] - 0s 119us/step - loss: 0.4013
Epoch 61/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.4264
Epoch 62/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.4192
Epoch 63/500
2916/2916 [==============================] - 0s 121us/step - loss: 0.3980
Epoch 64/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.4252
Epoch 65/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3992
Epoch 66/500
2916/2916 [==============================] - 0s 136us/step - loss: 0.4089
Epoch 67/500
2916/2916 [==============================] - 0s 157us/step - loss: 0.4166
Epoch 68/500
2916/2916 [==============================] - 0s 145us/step - loss: 0.3972
Epoch 69/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3976
Epoch 70/500
2916/2916 [==============================] - 0s 121us/step - loss: 0.4126
Epoch 71/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.4045
Epoch 72/500
2916/2916 [==============================] - 0s 143us/step - loss: 0.3932
Epoch 73/500
2916/2916 [==============================] - 0s 157us/step - loss: 0.4159
Epoch 74/500
2916/2916 [==============================] - 0s 141us/step - loss: 0.4229
Epoch 75/500
2916/2916 [==============================] - 0s 136us/step - loss: 0.4301
Epoch 76/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.4170
Epoch 77/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.4026
Epoch 78/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.3973
Epoch 79/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3991
Epoch 80/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3987
Epoch 81/500
2916/2916 [==============================] - 0s 137us/step - loss: 0.4246
Epoch 82/500
2916/2916 [==============================] - 0s 162us/step - loss: 0.3976
Epoch 83/500
2916/2916 [==============================] - 0s 145us/step - loss: 0.4195
Epoch 84/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3993
Epoch 85/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3992
Epoch 86/500
2916/2916 [==============================] - 0s 131us/step - loss: 0.4121
Epoch 87/500
2916/2916 [==============================] - 0s 141us/step - loss: 0.4117
Epoch 88/500
2916/2916 [==============================] - 0s 137us/step - loss: 0.4319
Epoch 89/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.4034
Epoch 90/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.4039
Epoch 91/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3917
Epoch 92/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.4124
Epoch 93/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.4115
Epoch 94/500
2916/2916 [==============================] - 0s 157us/step - loss: 0.4048
Epoch 95/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.4217
Epoch 96/500
2916/2916 [==============================] - 0s 139us/step - loss: 0.3943
Epoch 97/500
2916/2916 [==============================] - 0s 163us/step - loss: 0.3938
Epoch 98/500
2916/2916 [==============================] - 0s 142us/step - loss: 0.3986
Epoch 99/500
2916/2916 [==============================] - 0s 154us/step - loss: 0.3930
Epoch 100/500
2916/2916 [==============================] - 0s 157us/step - loss: 0.3835
Epoch 101/500
2916/2916 [==============================] - 0s 154us/step - loss: 0.4058
Epoch 102/500
2916/2916 [==============================] - 0s 142us/step - loss: 0.3961
Epoch 103/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.4174
Epoch 104/500
2916/2916 [==============================] - 0s 131us/step - loss: 0.4053
Epoch 105/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3977
Epoch 106/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.4119
Epoch 107/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.4075
Epoch 108/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.4223
Epoch 109/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.4164
Epoch 110/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3868
Epoch 111/500
2916/2916 [==============================] - 0s 143us/step - loss: 0.4024
Epoch 112/500
2916/2916 [==============================] - 1s 178us/step - loss: 0.4030
Epoch 113/500
2916/2916 [==============================] - 0s 154us/step - loss: 0.3905
Epoch 114/500
2916/2916 [==============================] - 0s 135us/step - loss: 0.4115
Epoch 115/500
2916/2916 [==============================] - 0s 139us/step - loss: 0.3779
Epoch 116/500
2916/2916 [==============================] - 0s 150us/step - loss: 0.3918
Epoch 117/500
2916/2916 [==============================] - 0s 146us/step - loss: 0.3965
Epoch 118/500
2916/2916 [==============================] - 0s 151us/step - loss: 0.3979
Epoch 119/500
2916/2916 [==============================] - 0s 138us/step - loss: 0.4041
Epoch 120/500
2916/2916 [==============================] - 0s 133us/step - loss: 0.4149
Epoch 121/500
2916/2916 [==============================] - 0s 140us/step - loss: 0.3960
Epoch 122/500
2916/2916 [==============================] - 0s 137us/step - loss: 0.4009
Epoch 123/500
2916/2916 [==============================] - 0s 135us/step - loss: 0.4335
Epoch 124/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.4212
Epoch 125/500
2916/2916 [==============================] - 0s 135us/step - loss: 0.3995
Epoch 126/500
2916/2916 [==============================] - 0s 169us/step - loss: 0.3953
Epoch 127/500
2916/2916 [==============================] - 0s 146us/step - loss: 0.3921
Epoch 128/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3890
Epoch 129/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3901
Epoch 130/500
2916/2916 [==============================] - 0s 138us/step - loss: 0.3948
Epoch 131/500
2916/2916 [==============================] - 0s 146us/step - loss: 0.3927
Epoch 132/500
2916/2916 [==============================] - 0s 134us/step - loss: 0.4067
Epoch 133/500
2916/2916 [==============================] - 0s 132us/step - loss: 0.3910
Epoch 134/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3951
Epoch 135/500
2916/2916 [==============================] - 0s 121us/step - loss: 0.3965
Epoch 136/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.3953
Epoch 137/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3846
Epoch 138/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.3905
Epoch 139/500
2916/2916 [==============================] - 0s 135us/step - loss: 0.3948
Epoch 140/500
2916/2916 [==============================] - 0s 143us/step - loss: 0.4043
Epoch 141/500
2916/2916 [==============================] - 0s 163us/step - loss: 0.4078
Epoch 142/500
2916/2916 [==============================] - 0s 148us/step - loss: 0.3960
Epoch 143/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3928
Epoch 144/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.4045
Epoch 145/500
2916/2916 [==============================] - 0s 137us/step - loss: 0.3825
Epoch 146/500
2916/2916 [==============================] - 0s 145us/step - loss: 0.4007
Epoch 147/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.4426
Epoch 148/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.4012
Epoch 149/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3990
Epoch 150/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.3916
Epoch 151/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.4023
Epoch 152/500
2916/2916 [==============================] - 0s 135us/step - loss: 0.3981
Epoch 153/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3908
Epoch 154/500
2916/2916 [==============================] - 0s 129us/step - loss: 0.3805
Epoch 155/500
2916/2916 [==============================] - 0s 120us/step - loss: 0.3976
Epoch 156/500
2916/2916 [==============================] - 0s 157us/step - loss: 0.3871
Epoch 157/500
2916/2916 [==============================] - 0s 155us/step - loss: 0.3975
Epoch 158/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3936
Epoch 159/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.4107
Epoch 160/500
2916/2916 [==============================] - 0s 139us/step - loss: 0.4049
Epoch 161/500
2916/2916 [==============================] - 0s 139us/step - loss: 0.4832
Epoch 162/500
2916/2916 [==============================] - 0s 132us/step - loss: 0.4203
Epoch 163/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.4092
Epoch 164/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3967
Epoch 165/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.4026
Epoch 166/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.4015
Epoch 167/500
2916/2916 [==============================] - 0s 131us/step - loss: 0.3899
Epoch 168/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3889
Epoch 169/500
2916/2916 [==============================] - 0s 129us/step - loss: 0.4050
Epoch 170/500
2916/2916 [==============================] - 0s 144us/step - loss: 0.3922
Epoch 171/500
2916/2916 [==============================] - 1s 180us/step - loss: 0.4039
Epoch 172/500
2916/2916 [==============================] - 1s 181us/step - loss: 0.4042
Epoch 173/500
2916/2916 [==============================] - 0s 136us/step - loss: 0.3976
Epoch 174/500
2916/2916 [==============================] - 0s 154us/step - loss: 0.3838
Epoch 175/500
2916/2916 [==============================] - 1s 221us/step - loss: 0.4081
Epoch 176/500
2916/2916 [==============================] - 1s 219us/step - loss: 0.4109
Epoch 177/500
2916/2916 [==============================] - 0s 130us/step - loss: 0.3869
Epoch 178/500
2916/2916 [==============================] - 0s 115us/step - loss: 0.4170 0s - loss: 0.
Epoch 179/500
2916/2916 [==============================] - 0s 115us/step - loss: 0.3969
Epoch 180/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.4079
Epoch 181/500
2916/2916 [==============================] - 0s 113us/step - loss: 0.3865
Epoch 182/500
2916/2916 [==============================] - 0s 118us/step - loss: 0.3990
Epoch 183/500
2916/2916 [==============================] - 0s 117us/step - loss: 0.3930
Epoch 184/500
2916/2916 [==============================] - 0s 116us/step - loss: 0.4254 0s - loss: 
Epoch 185/500
2916/2916 [==============================] - 0s 130us/step - loss: 0.3974
Epoch 186/500
2916/2916 [==============================] - 0s 142us/step - loss: 0.3789
Epoch 187/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3927
Epoch 188/500
2916/2916 [==============================] - 0s 118us/step - loss: 0.3842
Epoch 189/500
2916/2916 [==============================] - 0s 135us/step - loss: 0.4190
Epoch 190/500
2916/2916 [==============================] - 0s 135us/step - loss: 0.3911
Epoch 191/500
2916/2916 [==============================] - 0s 132us/step - loss: 0.4394
Epoch 192/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.4121
Epoch 193/500
2916/2916 [==============================] - 0s 135us/step - loss: 0.3949
Epoch 194/500
2916/2916 [==============================] - 0s 152us/step - loss: 0.4023
Epoch 195/500
2916/2916 [==============================] - 0s 154us/step - loss: 0.3910
Epoch 196/500
2916/2916 [==============================] - 0s 160us/step - loss: 0.3948
Epoch 197/500
2916/2916 [==============================] - 0s 144us/step - loss: 0.3986
Epoch 198/500
2916/2916 [==============================] - 0s 137us/step - loss: 0.3925
Epoch 199/500
2916/2916 [==============================] - 0s 158us/step - loss: 0.3940
Epoch 200/500
2916/2916 [==============================] - 1s 178us/step - loss: 0.3941
Epoch 201/500
2916/2916 [==============================] - 1s 195us/step - loss: 0.3761
Epoch 202/500
2916/2916 [==============================] - 0s 119us/step - loss: 0.4136
Epoch 203/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3956
Epoch 204/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.3923
Epoch 205/500
2916/2916 [==============================] - 0s 133us/step - loss: 0.4133
Epoch 206/500
2916/2916 [==============================] - 0s 121us/step - loss: 0.3998
Epoch 207/500
2916/2916 [==============================] - 0s 129us/step - loss: 0.4064
Epoch 208/500
2916/2916 [==============================] - 0s 146us/step - loss: 0.3984
Epoch 209/500
2916/2916 [==============================] - 0s 138us/step - loss: 0.3795
Epoch 210/500
2916/2916 [==============================] - 0s 129us/step - loss: 0.3841
Epoch 211/500
2916/2916 [==============================] - 0s 136us/step - loss: 0.3858
Epoch 212/500
2916/2916 [==============================] - 0s 134us/step - loss: 0.4076
Epoch 213/500
2916/2916 [==============================] - 0s 157us/step - loss: 0.3780
Epoch 214/500
2916/2916 [==============================] - 0s 165us/step - loss: 0.4558
Epoch 215/500
2916/2916 [==============================] - 1s 189us/step - loss: 0.3991
Epoch 216/500
2916/2916 [==============================] - 0s 150us/step - loss: 0.3924
Epoch 217/500
2916/2916 [==============================] - 0s 140us/step - loss: 0.3891
Epoch 218/500
2916/2916 [==============================] - 0s 155us/step - loss: 0.3842
Epoch 219/500
2916/2916 [==============================] - 0s 163us/step - loss: 0.3671
Epoch 220/500
2916/2916 [==============================] - 0s 119us/step - loss: 0.3927
Epoch 221/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3955
Epoch 222/500
2916/2916 [==============================] - 0s 138us/step - loss: 0.3974
Epoch 223/500
2916/2916 [==============================] - 0s 139us/step - loss: 0.3727
Epoch 224/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3905
Epoch 225/500
2916/2916 [==============================] - 0s 112us/step - loss: 0.3819
Epoch 226/500
2916/2916 [==============================] - 0s 112us/step - loss: 0.4034
Epoch 227/500
2916/2916 [==============================] - 1s 174us/step - loss: 0.3951
Epoch 228/500
2916/2916 [==============================] - 0s 130us/step - loss: 0.3939
Epoch 229/500
2916/2916 [==============================] - 1s 183us/step - loss: 0.4115 0s - loss:
Epoch 230/500
2916/2916 [==============================] - 0s 164us/step - loss: 0.3748
Epoch 231/500
2916/2916 [==============================] - 0s 153us/step - loss: 0.3875
Epoch 232/500
2916/2916 [==============================] - 0s 157us/step - loss: 0.4101
Epoch 233/500
2916/2916 [==============================] - 0s 152us/step - loss: 0.3927
Epoch 234/500
2916/2916 [==============================] - 0s 137us/step - loss: 0.4118 0s - loss:
Epoch 235/500
2916/2916 [==============================] - 0s 131us/step - loss: 0.3955
Epoch 236/500
2916/2916 [==============================] - 0s 155us/step - loss: 0.3746
Epoch 237/500
2916/2916 [==============================] - 0s 144us/step - loss: 0.4003
Epoch 238/500
2916/2916 [==============================] - 0s 149us/step - loss: 0.3884
Epoch 239/500
2916/2916 [==============================] - 0s 156us/step - loss: 0.3893
Epoch 240/500
2916/2916 [==============================] - 0s 161us/step - loss: 0.3783
Epoch 241/500
2916/2916 [==============================] - 0s 151us/step - loss: 0.3972
Epoch 242/500
2916/2916 [==============================] - 0s 144us/step - loss: 0.3972
Epoch 243/500
2916/2916 [==============================] - 1s 175us/step - loss: 0.3812
Epoch 244/500
2916/2916 [==============================] - 0s 165us/step - loss: 0.3934
Epoch 245/500
2916/2916 [==============================] - 0s 157us/step - loss: 0.3854
Epoch 246/500
2916/2916 [==============================] - 0s 146us/step - loss: 0.3937
Epoch 247/500
2916/2916 [==============================] - 0s 137us/step - loss: 0.4118
Epoch 248/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.4014
Epoch 249/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3911
Epoch 250/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.3884
Epoch 251/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3766
Epoch 252/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.3894
Epoch 253/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3855
Epoch 254/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3781
Epoch 255/500
2916/2916 [==============================] - 0s 129us/step - loss: 0.3735
Epoch 256/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3726
Epoch 257/500
2916/2916 [==============================] - 0s 148us/step - loss: 0.3784
Epoch 258/500
2916/2916 [==============================] - 0s 156us/step - loss: 0.3899
Epoch 259/500
2916/2916 [==============================] - 0s 139us/step - loss: 0.4113
Epoch 260/500
2916/2916 [==============================] - 0s 139us/step - loss: 0.3924
Epoch 261/500
2916/2916 [==============================] - 0s 142us/step - loss: 0.3928
Epoch 262/500
2916/2916 [==============================] - ETA: 0s - loss: 0.382 - 0s 136us/step - loss: 0.3820
Epoch 263/500
2916/2916 [==============================] - 0s 129us/step - loss: 0.3833
Epoch 264/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3711
Epoch 265/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3889
Epoch 266/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3907
Epoch 267/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3939
Epoch 268/500
2916/2916 [==============================] - 0s 131us/step - loss: 0.3866
Epoch 269/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3802
Epoch 270/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.3822
Epoch 271/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.3935
Epoch 272/500
2916/2916 [==============================] - 0s 138us/step - loss: 0.4016
Epoch 273/500
2916/2916 [==============================] - 0s 164us/step - loss: 0.3779
Epoch 274/500
2916/2916 [==============================] - 0s 144us/step - loss: 0.3860
Epoch 275/500
2916/2916 [==============================] - 0s 140us/step - loss: 0.3809
Epoch 276/500
2916/2916 [==============================] - 0s 141us/step - loss: 0.3952
Epoch 277/500
2916/2916 [==============================] - 0s 131us/step - loss: 0.3908
Epoch 278/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.3786
Epoch 279/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.4048
Epoch 280/500
2916/2916 [==============================] - 0s 130us/step - loss: 0.3924
Epoch 281/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3997
Epoch 282/500
2916/2916 [==============================] - 0s 121us/step - loss: 0.3811
Epoch 283/500
2916/2916 [==============================] - 0s 129us/step - loss: 0.3743
Epoch 284/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3934
Epoch 285/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.4161
Epoch 286/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.3804
Epoch 287/500
2916/2916 [==============================] - 0s 135us/step - loss: 0.3977
Epoch 288/500
2916/2916 [==============================] - 0s 160us/step - loss: 0.3787
Epoch 289/500
2916/2916 [==============================] - 0s 157us/step - loss: 0.3840
Epoch 290/500
2916/2916 [==============================] - 0s 144us/step - loss: 0.3835
Epoch 291/500
2916/2916 [==============================] - 0s 142us/step - loss: 0.3723
Epoch 292/500
2916/2916 [==============================] - 0s 132us/step - loss: 0.3801
Epoch 293/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.3782
Epoch 294/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.3863
Epoch 295/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.4148
Epoch 296/500
2916/2916 [==============================] - 0s 129us/step - loss: 0.4066
Epoch 297/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3935
Epoch 298/500
2916/2916 [==============================] - 0s 132us/step - loss: 0.3753
Epoch 299/500
2916/2916 [==============================] - 0s 132us/step - loss: 0.4057
Epoch 300/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3746
Epoch 301/500
2916/2916 [==============================] - 0s 133us/step - loss: 0.3756
Epoch 302/500
2916/2916 [==============================] - 0s 135us/step - loss: 0.3813
Epoch 303/500
2916/2916 [==============================] - 0s 161us/step - loss: 0.4194
Epoch 304/500
2916/2916 [==============================] - 0s 149us/step - loss: 0.3784
Epoch 305/500
2916/2916 [==============================] - 0s 145us/step - loss: 0.4832
Epoch 306/500
2916/2916 [==============================] - 0s 144us/step - loss: 0.4317
Epoch 307/500
2916/2916 [==============================] - 0s 129us/step - loss: 0.3961
Epoch 308/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.3848
Epoch 309/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.3719
Epoch 310/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3775
Epoch 311/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3978
Epoch 312/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3813
Epoch 313/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3871
Epoch 314/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3752
Epoch 315/500
2916/2916 [==============================] - 0s 132us/step - loss: 0.4123
Epoch 316/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3736
Epoch 317/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3949
Epoch 318/500
2916/2916 [==============================] - 0s 158us/step - loss: 0.3882
Epoch 319/500
2916/2916 [==============================] - 0s 156us/step - loss: 0.3828
Epoch 320/500
2916/2916 [==============================] - 1s 174us/step - loss: 0.3859
Epoch 321/500
2916/2916 [==============================] - 0s 144us/step - loss: 0.3952
Epoch 322/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.3746
Epoch 323/500
2916/2916 [==============================] - 1s 180us/step - loss: 0.3848
Epoch 324/500
2916/2916 [==============================] - 0s 146us/step - loss: 0.3767
Epoch 325/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3975
Epoch 326/500
2916/2916 [==============================] - 0s 119us/step - loss: 0.3854
Epoch 327/500
2916/2916 [==============================] - 0s 141us/step - loss: 0.3663
Epoch 328/500
2916/2916 [==============================] - 0s 145us/step - loss: 0.3764
Epoch 329/500
2916/2916 [==============================] - 0s 134us/step - loss: 0.3738
Epoch 330/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.3892
Epoch 331/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3766
Epoch 332/500
2916/2916 [==============================] - 0s 154us/step - loss: 0.3718
Epoch 333/500
2916/2916 [==============================] - 0s 171us/step - loss: 0.3731
Epoch 334/500
2916/2916 [==============================] - 0s 162us/step - loss: 0.3698
Epoch 335/500
2916/2916 [==============================] - 0s 161us/step - loss: 0.3813
Epoch 336/500
2916/2916 [==============================] - ETA: 0s - loss: 0.390 - 0s 133us/step - loss: 0.3923
Epoch 337/500
2916/2916 [==============================] - 0s 130us/step - loss: 0.3749 0s - loss:
Epoch 338/500
2916/2916 [==============================] - 0s 121us/step - loss: 0.3696
Epoch 339/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3766
Epoch 340/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3752
Epoch 341/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3944
Epoch 342/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.4030
Epoch 343/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3674
Epoch 344/500
2916/2916 [==============================] - 0s 120us/step - loss: 0.3894
Epoch 345/500
2916/2916 [==============================] - 0s 155us/step - loss: 0.3797
Epoch 346/500
2916/2916 [==============================] - 0s 164us/step - loss: 0.4015
Epoch 347/500
2916/2916 [==============================] - 1s 177us/step - loss: 0.3828
Epoch 348/500
2916/2916 [==============================] - 0s 156us/step - loss: 0.3824
Epoch 349/500
2916/2916 [==============================] - 0s 143us/step - loss: 0.3814
Epoch 350/500
2916/2916 [==============================] - 0s 142us/step - loss: 0.3843
Epoch 351/500
2916/2916 [==============================] - 0s 158us/step - loss: 0.3913
Epoch 352/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3980
Epoch 353/500
2916/2916 [==============================] - 0s 136us/step - loss: 0.3673
Epoch 354/500
2916/2916 [==============================] - 0s 133us/step - loss: 0.3713
Epoch 355/500
2916/2916 [==============================] - 0s 146us/step - loss: 0.3920
Epoch 356/500
2916/2916 [==============================] - 0s 117us/step - loss: 0.3835
Epoch 357/500
2916/2916 [==============================] - 0s 142us/step - loss: 0.3656
Epoch 358/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3653
Epoch 359/500
2916/2916 [==============================] - 0s 131us/step - loss: 0.4795
Epoch 360/500
2916/2916 [==============================] - 0s 130us/step - loss: 0.4251
Epoch 361/500
2916/2916 [==============================] - 0s 151us/step - loss: 0.4129
Epoch 362/500
2916/2916 [==============================] - 0s 160us/step - loss: 0.4182
Epoch 363/500
2916/2916 [==============================] - 0s 148us/step - loss: 0.3936
Epoch 364/500
2916/2916 [==============================] - 0s 145us/step - loss: 0.3788
Epoch 365/500
2916/2916 [==============================] - 0s 129us/step - loss: 0.3608
Epoch 366/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.3729
Epoch 367/500
2916/2916 [==============================] - 0s 130us/step - loss: 0.3709
Epoch 368/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3548
Epoch 369/500
2916/2916 [==============================] - 0s 131us/step - loss: 0.3689
Epoch 370/500
2916/2916 [==============================] - 0s 137us/step - loss: 0.3772
Epoch 371/500
2916/2916 [==============================] - 0s 137us/step - loss: 0.3720
Epoch 372/500
2916/2916 [==============================] - 0s 131us/step - loss: 0.3790
Epoch 373/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3609 0s - loss: 
Epoch 374/500
2916/2916 [==============================] - 0s 129us/step - loss: 0.3800
Epoch 375/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3695
Epoch 376/500
2916/2916 [==============================] - 0s 156us/step - loss: 0.3816
Epoch 377/500
2916/2916 [==============================] - 0s 160us/step - loss: 0.3897
Epoch 378/500
2916/2916 [==============================] - 0s 152us/step - loss: 0.3606
Epoch 379/500
2916/2916 [==============================] - 0s 140us/step - loss: 0.3796
Epoch 380/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.3957
Epoch 381/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3732
Epoch 382/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3611
Epoch 383/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.3711
Epoch 384/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3657
Epoch 385/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.3768
Epoch 386/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.4062
Epoch 387/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3725
Epoch 388/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3718
Epoch 389/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3545
Epoch 390/500
2916/2916 [==============================] - 0s 134us/step - loss: 0.3874
Epoch 391/500
2916/2916 [==============================] - 0s 144us/step - loss: 0.3755
Epoch 392/500
2916/2916 [==============================] - 0s 160us/step - loss: 0.3918
Epoch 393/500
2916/2916 [==============================] - 0s 159us/step - loss: 0.3861
Epoch 394/500
2916/2916 [==============================] - 0s 154us/step - loss: 0.3922
Epoch 395/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3833 0s - loss: 0
Epoch 396/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3723
Epoch 397/500
2916/2916 [==============================] - 0s 134us/step - loss: 0.3834
Epoch 398/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3693
Epoch 399/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3593
Epoch 400/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3785
Epoch 401/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3730
Epoch 402/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3777
Epoch 403/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3625
Epoch 404/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3710
Epoch 405/500
2916/2916 [==============================] - 0s 130us/step - loss: 0.3609
Epoch 406/500
2916/2916 [==============================] - 0s 136us/step - loss: 0.3884
Epoch 407/500
2916/2916 [==============================] - 0s 158us/step - loss: 0.3688
Epoch 408/500
2916/2916 [==============================] - 0s 168us/step - loss: 0.3674
Epoch 409/500
2916/2916 [==============================] - 0s 144us/step - loss: 0.3824
Epoch 410/500
2916/2916 [==============================] - 0s 132us/step - loss: 0.3781
Epoch 411/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3790
Epoch 412/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3633
Epoch 413/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3566
Epoch 414/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3670
Epoch 415/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.3756
Epoch 416/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3804
Epoch 417/500
2916/2916 [==============================] - 0s 121us/step - loss: 0.3701
Epoch 418/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3719
Epoch 419/500
2916/2916 [==============================] - 0s 121us/step - loss: 0.3644
Epoch 420/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3767
Epoch 421/500
2916/2916 [==============================] - 0s 130us/step - loss: 0.3592
Epoch 422/500
2916/2916 [==============================] - 0s 156us/step - loss: 0.3874
Epoch 423/500
2916/2916 [==============================] - 1s 179us/step - loss: 0.3679
Epoch 424/500
2916/2916 [==============================] - 0s 153us/step - loss: 0.4397
Epoch 425/500
2916/2916 [==============================] - 0s 136us/step - loss: 0.3903
Epoch 426/500
2916/2916 [==============================] - 0s 121us/step - loss: 0.3825
Epoch 427/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3711
Epoch 428/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.3658
Epoch 429/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3665
Epoch 430/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3636
Epoch 431/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3690
Epoch 432/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.3607
Epoch 433/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.3679
Epoch 434/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3631
Epoch 435/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.3663
Epoch 436/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.3656
Epoch 437/500
2916/2916 [==============================] - 0s 143us/step - loss: 0.3887
Epoch 438/500
2916/2916 [==============================] - 0s 164us/step - loss: 0.3524
Epoch 439/500
2916/2916 [==============================] - 0s 160us/step - loss: 0.3651
Epoch 440/500
2916/2916 [==============================] - 0s 140us/step - loss: 0.3492
Epoch 441/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3711
Epoch 442/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3699
Epoch 443/500
2916/2916 [==============================] - 0s 130us/step - loss: 0.4054
Epoch 444/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3598
Epoch 445/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3588
Epoch 446/500
2916/2916 [==============================] - 0s 131us/step - loss: 0.3724
Epoch 447/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3732
Epoch 448/500
2916/2916 [==============================] - 0s 130us/step - loss: 0.3822
Epoch 449/500
2916/2916 [==============================] - 0s 129us/step - loss: 0.3777
Epoch 450/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3721
Epoch 451/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.3694
Epoch 452/500
2916/2916 [==============================] - 0s 133us/step - loss: 0.3770
Epoch 453/500
2916/2916 [==============================] - 0s 161us/step - loss: 0.3752
Epoch 454/500
2916/2916 [==============================] - 0s 171us/step - loss: 0.3832
Epoch 455/500
2916/2916 [==============================] - 0s 140us/step - loss: 0.3680
Epoch 456/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3728
Epoch 457/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.3629
Epoch 458/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.3686
Epoch 459/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.3641
Epoch 460/500
2916/2916 [==============================] - 0s 129us/step - loss: 0.3649
Epoch 461/500
2916/2916 [==============================] - 0s 130us/step - loss: 0.3650
Epoch 462/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.3680
Epoch 463/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.3521
Epoch 464/500
2916/2916 [==============================] - 0s 130us/step - loss: 0.3571
Epoch 465/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.3560
Epoch 466/500
2916/2916 [==============================] - 0s 132us/step - loss: 0.3795
Epoch 467/500
2916/2916 [==============================] - 0s 139us/step - loss: 0.3753
Epoch 468/500
2916/2916 [==============================] - 0s 159us/step - loss: 0.3769
Epoch 469/500
2916/2916 [==============================] - 1s 187us/step - loss: 0.3780
Epoch 470/500
2916/2916 [==============================] - 0s 167us/step - loss: 0.3709
Epoch 471/500
2916/2916 [==============================] - 0s 141us/step - loss: 0.3506
Epoch 472/500
2916/2916 [==============================] - 0s 143us/step - loss: 0.3519
Epoch 473/500
2916/2916 [==============================] - 0s 137us/step - loss: 0.3643
Epoch 474/500
2916/2916 [==============================] - 0s 149us/step - loss: 0.3679
Epoch 475/500
2916/2916 [==============================] - 0s 137us/step - loss: 0.3715 0s - loss: 0
Epoch 476/500
2916/2916 [==============================] - 0s 139us/step - loss: 0.3673
Epoch 477/500
2916/2916 [==============================] - 0s 142us/step - loss: 0.3650
Epoch 478/500
2916/2916 [==============================] - 0s 135us/step - loss: 0.3536
Epoch 479/500
2916/2916 [==============================] - 0s 128us/step - loss: 0.3734
Epoch 480/500
2916/2916 [==============================] - 0s 122us/step - loss: 0.3552
Epoch 481/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3789
Epoch 482/500
2916/2916 [==============================] - 0s 153us/step - loss: 0.3843
Epoch 483/500
2916/2916 [==============================] - 1s 173us/step - loss: 0.3730
Epoch 484/500
2916/2916 [==============================] - 0s 155us/step - loss: 0.3596
Epoch 485/500
2916/2916 [==============================] - 0s 159us/step - loss: 0.3631
Epoch 486/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3587
Epoch 487/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3710
Epoch 488/500
2916/2916 [==============================] - 0s 130us/step - loss: 0.3566
Epoch 489/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.3668
Epoch 490/500
2916/2916 [==============================] - 0s 135us/step - loss: 0.3641
Epoch 491/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.3806
Epoch 492/500
2916/2916 [==============================] - 0s 124us/step - loss: 0.3689
Epoch 493/500
2916/2916 [==============================] - 0s 126us/step - loss: 0.3626
Epoch 494/500
2916/2916 [==============================] - 0s 123us/step - loss: 0.3623
Epoch 495/500
2916/2916 [==============================] - 0s 125us/step - loss: 0.4024
Epoch 496/500
2916/2916 [==============================] - 0s 127us/step - loss: 0.3924
Epoch 497/500
2916/2916 [==============================] - 0s 154us/step - loss: 0.3636
Epoch 498/500
2916/2916 [==============================] - 0s 168us/step - loss: 0.3647
Epoch 499/500
2916/2916 [==============================] - 0s 149us/step - loss: 0.3821
Epoch 500/500
2916/2916 [==============================] - 0s 138us/step - loss: 0.3760
Out[186]:
<keras.callbacks.callbacks.History at 0x23d38846e48>
In [187]:
pred = model.predict(X_test)
In [188]:
error = sqrt(mean_squared_error(y_test,pred)) #calculate rmse
print('RMSE is:', error)
RMSE is: 0.8706200673451744
In [189]:
r2_score(y_test, pred)
Out[189]:
0.309198496410218
In [ ]:
 

Future steps:

  • Optimization of the models can be improved
  • Classifying the IMDB_score into bin and using a classifier model can be more effective as the imdb score ranges from 0-10.
  • More data can be introduced for better results
In [ ]: